Audio coding/decoding with spatial parameters and non-uniform segmentation for transients

ABSTRACT

In binaural stereo coding, only one monaural channel is encoded. An additional layer holds the parameters to retrieve the left and right signal. An encoder is disclosed which links transient information extracted from the mono encoded signal to parametric multi-channel layers to provide increased performance. Transient positions can either be directly derived from the bit-stream or be estimated from other encoded parameters (e.g. window-switching flag in mp3).

FIELD OF THE INVENTION

The present invention relates to audio coding.

BACKGROUND OF THE INVENTION

In traditional waveform based audio coding schemes such as MPEG-LII, mp3and AAC (MPEG-2 Advanced Audio Coding), stereo signals are encoded byencoding two monaural audio signals into one bit-stream. However, byexploiting inter-channel correlation and irrelevancy with techniquessuch as mid/side stereo coding and intensity coding bit rate savings canbe made.

In the case of mid/side stereo coding, stereo signals with a high amountof mono content can be split into a sum M=(L+R)/2 and a differenceS=(L−R)/2 signal. This decomposition is sometimes combined withprinciple component analysis or time-varying scale-factors. The signalsare then coded independently, either by a parametric coder or a waveformcoder (e.g. transform or subband coder). For certain frequency regionsthis technique can result in a slightly higher energy for either the Mor S signal. However, for certain frequency regions a significantreduction of energy can be obtained for either the M or S signal. Theamount of information reduction achieved by this technique stronglydepends on the spatial properties of the source signal. For example, ifthe source signal is monaural, the difference signal is zero and can bediscarded. However, if the correlation of the left and right audiosignals is low (which is often the case for the higher frequencyregions), this scheme offers only little advantage.

In the case of intensity stereo coding, for a certain frequency region,only one signal I=(L+R)/2 is encoded along with intensity informationfor the L and R signal. At the decoder side this signal I is used forboth the L and R signal after scaling it with the correspondingintensity information. In this technique, high frequencies (typicallyabove 5 kHz) are represented by a single audio signal (i.e., mono),combined with time-varying and frequency-dependent scale-factors.

Parametric descriptions of audio signals have gained interest during thelast years, especially in the field of audio coding. It has been shownthat transmitting (quantized) parameters that describe audio signalsrequires only little transmission capacity to re-synthesize aperceptually equal signal at the receiving end. However, currentparametric audio coders focus on coding monaural signals, and stereosignals are often processed as dual mono.

EP-A-1107232 discloses a parametric coding scheme to generate arepresentation of a stereo audio signal which is composed of a leftchannel signal and a right channel signal. To efficiently utilizetransmission bandwidth, such a representation contains informationconcerning only a monaural signal which is either the left channelsignal or the right channel signal, and parametric information. Theother stereo signal can be recovered based on the monaural signaltogether with the parametric information. The parametric informationcomprises localization cues of the stereo audio signal, includingintensity and phase characteristics of the left and the right channel.

In binaural stereo coding, similar to intensity stereo coding, only onemonaural channel is encoded. Additional side information holds theparameters to retrieve the left and right signal. European PatentApplication No. 02076588.9 filed April, 2002 discloses a parametricdescription of multi-channel audio related to a binaural processingmodel presented by Breebaart et al in “Binaural processing model basedon contralateral inhibition. I. Model setup”, J. Acoust. Soc. Am., 110,1074-1088, August 2001 and “Binaural processing model based oncontralateral inhibition. II. Dependence on spectral parameters”, J.Acoust. Soc. Am., 110, 1089-1104, August 2001, and “Binaural processingmodel based on contralateral inhibition. III. Dependence on temporalparameters”, J. Acoust. Soc. Am., 110, 1105-1117, August 2001 disclosesa binaural processing model. This comprises splitting an input audiosignal into several band-limited signals, which are spaced linearly atan (Equivalent Rectangular Bandwidth) ERB-rate scale. The bandwidth ofthese signals depends on the center frequency, following the ERB rate.Subsequently, for every frequency band, the following properties of theincoming signals are analyzed:

the interaural level difference (ILD) defined by the relative levels ofthe band-limited signal stemming from the left and right ears,

the interaural time (or phase) difference (ITD or IPD), defined by theinteraural delay (or phase shift) corresponding to the peak in theinteraural cross-correlation function, and

the (dis)similarity of the waveforms that can not be accounted for byITDs or ILDs, which can be parameterized by the maximum interauralcross-correlation (i.e., the value of the cross-correlation at theposition of the maximum peak). It is therefore known from the abovedisclosures that spatial attributes of any multi-channel audio signalmay be described by specifying the ILD, ITD (or IPD) and maximumcorrelation as a function of time and frequency.

This parametric coding technique provides reasonably good quality forgeneral audio signals. However, particularly for signals having a highernon-stationary behaviour, e.g. castanets, harpsichord, glockenspiel,etc, the technique suffers from pre-echo artifacts.

It is an object of this invention to provide an audio coder and decoderand corresponding methods that mitigate the artifacts related toparametric multi-channel coding.

DISCLOSURE OF THE PRESENT INVENTION

According to the present invention there is provided a method of codingan audio signal and a method of decoding a bitstream.

According to an aspect of the invention, spatial attributes ofmulti-channel audio signals are parameterized. Preferably, the spatialattributes comprise: level differences, temporal differences andcorrelations between the left and right signal.

Using the invention, transient positions either directly or indirectlyare extracted from a monaural signal and are linked to parametricmulti-channel representation layers. Utilizing this transientinformation in a parametric multi-channel layer provides increasedperformance.

It is acknowledged that in many audio coders, transient information isused to guide the coding process for better performance. For example, inthe sinusoidal coder described in WO01/69593-A1 transient positions areencoded in the bitstream. The coder may use these transient positionsfor adaptive segmentation (adaptive framing) of the bitstream. Also, inthe decoder, these positions may be used to guide the windowing for thesinusoidal and noise synthesis. However, these techniques have beenlimited to monaural signals.

In a preferred embodiment of the present invention, when decoding abitstream where the monaural content has been produced by such asinusoidal coder, the transient positions can be directly derived fromthe bit-stream.

In waveform coders, such as mp3 and AAC, transient positions are notdirectly encoded in the bitstream; rather it is assumed in the case ofmp3, for example, that transient intervals are marked by switching toshorter window-lengths (window switching) in the monaural layer and sotransient positions can be estimated from parameters such as the mp3window-switching flag.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will now be described, byway of example, with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram illustrating an encoder according to anembodiment of the invention;

FIG. 2 is a schematic diagram illustrating a decoder according to anembodiment of the invention;

FIG. 3 shows transient positions encoded in respective sub-frames of amonaural signal and the corresponding frames of a multi-channel layer;and

FIG. 4 shows an example of the exploitation of the transient positionfrom the monaural encoded layer for decoding a parametric multi-channellayer.

DESCRIPTION OF THE PREFERRED EMBODIMENT

Referring now to FIG. 1, there is shown an encoder 10 according to apreferred embodiment of the present invention for encoding a stereoaudio signal comprising left (L) and right (R) input signals. In thepreferred embodiment, as in European Patent Application No. 02076588.9filed April, 2002, the encoder describes a multi-channel audio signalwith:

-   -   one monaural signal 12, comprising a combination of the multiple        input audio signals, and    -   for each additional auditory channel, a set of spatial        parameters 14 comprising: two localization cues (ILD, and ITD or        IPD) and a parameter (r) that describes the similarity or        dissimilarity of the waveforms that cannot be accounted for by        ILDs and/or ITDs (e.g., the maximum of the cross-correlation        function) preferably for every time/frequency slot.

The set(s) of spatial parameters can be used as an enhancement layer byaudio coders. For example, a mono signal is transmitted if only a lowbit-rate is allowed, while by including the spatial enhancementlayer(s), a decoder can reproduce stereo or multi-channel sound.

It will be seen that while in this embodiment, a set of spatialparameters is combined with a monaural (single channel) audio coder toencode a stereo audio signal, the general idea can be applied ton-channel audio signals, with n>1. Thus, the invention can in principlebe used to generate n channels from one mono signal, if (n−1) sets ofspatial parameters are transmitted. In such cases, the spatialparameters describe how to form the n different audio channels from thesingle mono signal. Thus, in a decoder, by combining a subsequent set ofspatial parameters with the monaural coded signal, a subsequent channelis obtained.

Analysis Methods

In general, the encoder 10 comprises respective transform modules 20which split each incoming signal (L,R) into sub-band signals 16(preferably with a bandwidth which increases with frequency). In thepreferred embodiment, the modules 20 use time-windowing followed by atransform operation to perform time/frequency slicing, however,time-continuous methods could also be used (e.g., filterbanks).

The next steps for determination of the sum signal 12 and extraction ofthe parameters 14 are carried out within an analysis module 18 andcomprise:

finding the level difference (ILD) of corresponding sub-band signals 16,

finding the time difference (ITD or IPD) of corresponding sub-bandsignals 16, and

describing the amount of similarity or dissimilarity of the waveformswhich cannot be accounted for by ILDs or ITDs.

Analysis of ILDs

The ILD is determined by the level difference of the signals at acertain time instance for a given frequency band. One method todetermine the ILD is to measure the rms value of the correspondingfrequency band of both input channels and compute the ratio of these rmsvalues (preferably expressed in dB).

Analysis of the ITDs

The ITDs are determined by the time or phase alignment which gives thebest match between the waveforms of both channels. One method to obtainthe ITD is to compute the cross-correlation function between twocorresponding subband signals and searching for the maximum. The delaythat corresponds to this maximum in the cross-correlation function canbe used as ITD value.

A second method is to compute the analytic signals of the left and rightsubband (i.e., computing phase and envelope values) and use the phasedifference between the channels as IPD parameter. Here, a complexfilterbank (e.g. an FFT) is used and by looking at a certain bin(frequency region) a phase function can be derived over time. By doingthis for both left and right channel, the phase difference IPD (ratherthen cross-correlating two filtered signals) can be estimated.

Analysis of the Correlation

The correlation is obtained by first finding the ILD and ITD that givesthe best match between the corresponding subband signals andsubsequently measuring the similarity of the waveforms aftercompensation for the ITD and/or ILD. Thus, in this framework, thecorrelation is defined as the similarity or dissimilarity ofcorresponding subband signals which can not be attributed to ILDs and/orITDs. A suitable measure for this parameter is the maximum value of thecross-correlation function (i.e., the maximum across a set of delays).However, also other measures could be used, such as the relative energyof the difference signal after ILD and/or ITD compensation compared tothe sum signal of corresponding subbands (preferably also compensatedfor ILDs and/or ITDs). This difference parameter is basically a lineartransformation of the (maximum) correlation.

Parameter Quantization

An important issue of transmission of parameters is the accuracy of theparameter representation (i.e., the size of quantization errors), whichis directly related to the necessary transmission capacity and the audioquality. In this section, several issues with respect to thequantization of the spatial parameters will be discussed. The basic ideais to base the quantization errors on so-called just-noticeabledifferences (JNDs) of the spatial cues. To be more specific, thequantization error is determined by the sensitivity of the humanauditory system to changes in the parameters. Since it is well knownthat the sensitivity to changes in the parameters strongly depends onthe values of the parameters itself, the following methods are appliedto determine the discrete quantization steps.

Quantization of ILDs

It is known from psychoacoustic research that the sensitivity to changesin the ILD depends on the ILD itself. If the ILD is expressed in dB,deviations of approximately 1 dB from a reference of 0 dB aredetectable, while changes in the order of 3 dB are required if thereference level difference amounts 20 dB. Therefore, quantization errorscan be larger if the signals of the left and right channels have alarger level difference. For example, this can be applied by firstmeasuring the level difference between the channels, followed by anon-linear (compressive) transformation of the obtained level differenceand subsequently a line quantization process, or by using a lookup tablefor the available ILD values which have a nonlinear distribution. In thepreferred embodiment, ILDs (in dB) are quantized to the closest valueout of the following set I:I=[−19−16−13−10−8−6−4−2 0 2 4 6 8 10 13 16 19]Quantization of the ITDs

The sensitivity to changes in the ITDs of human subjects can becharacterized as having a constant phase threshold. This means that interms of delay times, the quantization steps for the ITD should decreasewith frequency. Alternatively, if the ITD is represented in the form ofphase differences, the quantization steps should be independent offrequency. One method to implement this would be to take a fixed phasedifference as quantization step and determine the corresponding timedelay for each frequency band. This ITD value is then used asquantization step. In the preferred embodiment, ITD quantization stepsare determined by a constant phase difference in each subband of 0.1radians (rad). Thus, for each subband, the time difference thatcorresponds to 0.1 rad of the subband center frequency is used asquantization step. For frequencies above 2 kHz, no ITD information istransmitted.

Another method would be to transmit phase differences which follow afrequency-independent quantization scheme. It is also known that above acertain frequency, the human auditory system is not sensitive to ITDs inthe fine structure waveforms. This phenomenon can be exploited by onlytransmitting ITD parameters up to a certain frequency (typically 2 kHz).

A third method of bitstream reduction is to incorporate ITD quantizationsteps that depend on the ILD and/or the correlation parameters of thesame subband. For large ILDs, the ITDs can be coded less accurately.Furthermore, if the correlation it very low, it is known that the humansensitivity to changes in the ITD is reduced. Hence larger ITDquantization errors may be applied if the correlation is small. Anextreme example of this idea is to not transmit ITDs at all if thecorrelation is below a certain threshold.

Quantization of the Correlation

The quantization error of the correlation depends on (1) the correlationvalue itself and possibly (2) on the ILD. Correlation values near +1 arecoded with a high accuracy (i.e., a small quantization step), whilecorrelation values near 0 are coded with a low accuracy (a largequantization step). In the preferred embodiment, a set of non-linearlydistributed correlation values (r) are quantized to the closest value ofthe following ensemble R:R=[1 0.95 0.9 0.82 0.75 0.6 0.3 0]and this costs another 3 bits per correlation value.

If the absolute value of the (quantized) ILD of the current subbandamounts 19 dB, no ITD and correlation values are transmitted for thissubband. If the (quantized) correlation value of a certain subbandamounts zero, no ITD value is transmitted for that subband.

In this way, each frame requires a maximum of 233 bits to transmit thespatial parameters. With an update framelength of 1024 samples and asampling rate of 44.1 kHz, the maximum bitrate for transmission amountsless than 10.25 kbit/s [233*44100/1024=10.034 kbit/s]. (It should benoted that using entropy coding or differential coding, this bitrate canbe reduced further.)

A second possibility is to use quantization steps for the correlationthat depend on the measured ILD of the same subband: for large ILDs(i.e., one channel is dominant in terms of energy), the quantizationerrors in the correlation become larger. An extreme example of thisprinciple would be to not transmit correlation values for a certainsubband at all if the absolute value of the ILD for that subband isbeyond a certain threshold.

Detailed Implementation

In more detail, in the modules 20, the left and right incoming signalsare split up in various time frames (2048 samples at 44.1 kHz samplingrate) and windowed with a square-root Hanning window. Subsequently, FFTsare computed. The negative FFT frequencies are discarded and theresulting FFTs are subdivided into groups or subbands 16 of FFT bins.The number of FFT bins that are combined in a subband g depends on thefrequency: at higher frequencies more bins are combined than at lowerfrequencies. In the current implementation, FFT bins corresponding toapproximately 1.8 ERBs are grouped, resulting in 20 subbands torepresent the entire audible frequency range. The resulting number ofFFT bins S[g] of each subsequent subband (starting at the lowestfrequency) is S=[4 4 4 5 6 8 9 12 13 17 21 25 30 38 45 55 68 82 100 477]

Thus, the first three subbands contain 4 FFT bins, the fourth subbandcontains 5 FFT bins, etc. For each subband, the analysis module 18computes corresponding ILD, ITD and correlation (r). The ITD andcorrelation are computed simply by setting all FFT bins which belong toother groups to zero, multiplying the resulting (band-limited) FFTs fromthe left and right channels, followed by an inverse FFT transform. Theresulting cross-correlation function is scanned for a peak within aninterchannel delay between −64 and +63 samples. The internal delaycorresponding to the peak is used as ITD value, and the value of thecross-correlation function at this peak is used as this subband'sinteraural correlation. Finally, the ILD is simply computed by takingthe power ratio of the left and right channels for each subband.

Generation of the Sum Signal

The analyser 18 contains a sum signal generator 17 which performs phasecorrection (temporal alignment) on the left and right subbands beforesumming the signals. This phase correction follows from the computed ITDfor that subband and comprises delaying the left-channel subband withITD/2 and the right-channel subband with −ITD/2. The delay is performedin the frequency domain by appropriate modification of the phase anglesof each FFT bin. Subsequently, a summed signal is computed by adding thephase-modified versions of the left and right subband signals. Finally,to compensate for uncorrelated or correlated addition, each subband ofthe summed signal is multiplied with sqrt(2/(1+r)), with correlation (r)of the corresponding subband to generate the final sum signal 12. Ifnecessary, the sum signal can be converted to the time domain by (1)inserting complex conjugates at negative frequencies, (2) inverse FFT,(3) windowing, and (4) overlap-add.

Given the representation of the sum signal 12 in the time and/orfrequency domain as described above, the signal can be encoded in amonaural layer 40 of a bitstream 50 in any number of conventional ways.For example, a mp3 encoder can be used to generate the monaural layer 40of the bitstream. When such an encoder detects rapid changes in an inputsignal, it can change the window length it employs for that particulartime period so as to improve time and or frequency localization whenencoding that portion of the input signal. A window switching flag isthen embedded in the bitstream to indicate this switch to a decoderwhich later synthesizes the signal. For the purposes of the presentinvention, this window switching flag is used as an estimate of atransient position in an input signal.

In the preferred embodiment, however, a sinusoidal coder 30 of the typedescribed in WO01/69593-A1 is used to generate the monaural layer 40.The coder 30 comprises a transient coder 11, a sinusoidal coder 13 and anoise coder 15.

When the signal 12 enters the transient coder 11, for each updateinterval, the coder estimates if there is a transient signal componentand its position (to sample accuracy) within the analysis window. If theposition of a transient signal component is determined, the coder 11tries to extract (the main part of) the transient signal component. Itmatches a shape function to a signal segment preferably starting at anestimated start position, and determines content underneath the shapefunction, by employing for example a (small) number of sinusoidalcomponents and this information is contained in the transient code CT.

The sum signal 12 less the transient component is furnished to thesinusoidal coder 13 where it is analyzed to determine the(deterministic) sinusoidal components. In brief, the sinusoidal coderencodes the input signal as tracks of sinusoidal components linked fromone frame segment to the next. The tracks are initially represented by astart frequency, a start amplitude and a start phase for a sinusoidbeginning in a given segment—a birth. Thereafter, the track isrepresented in subsequent segments by frequency differences, amplitudedifferences and, possibly, phase differences (continuations) until thesegment in which the track ends (death) and this information iscontained in the sinusoidal code CS.

The signal less both the transient and sinusoidal components is assumedto mainly comprise noise and the noise analyzer 15 of the preferredembodiment produces a noise code CN representative of this noise.Conventionally, as in, for example, WO 01/89086-A1 a spectrum of thenoise is modeled by the noise coder with combined AR (auto-regressive)MA (moving average) filter parameters (pi,qi) according to an EquivalentRectangular Bandwidth (ERB) scale. Within a decoder, the filterparameters are fed to a noise synthesizer, which is mainly a filter,having a frequency response approximating the spectrum of the noise. Thesynthesizer generates reconstructed noise by filtering a white noisesignal with the ARMA filtering parameters (pi,qi) and subsequently addsthis to the synthesized transient and sinusoid signals to generate anestimate of the original sum signal.

The multiplexer 41 produces the monaural audio layer 40 which is dividedinto frames 42 which represent overlapping time segments of length 16 msand which are updated every 8 ms, FIG. 4. Each frame includes respectivecodes CT, CS and CN and in a decoder the codes for successive frames areblended in their overlap regions when synthesizing the monaural sumsignal. In the present embodiment, it is assumed that each frame mayonly include up to 1 transient code CT and an example of such atransient is indicated by the numeral 44.

Generation of the Sets Spatial Parameters

The analyser 18 further comprises a spatial parameter layer generator19. This component performs the quantization of the spatial parametersfor each spatial parameter frame as described above. In general, thegenerator 19 divides each spatial layer channel 14 into frames 46 whichrepresent overlapping time segments of length 64 ms and which areupdated every 32 ms, FIG. 4. Each frame includes respective ILD, ITD orIPD and correlation coefficients and in the decoder the values forsuccessive frames are blended in their overlap regions to determine thespatial layer parameters for any given time when synthesizing thesignal.

In the preferred embodiment, transient positions detected by thetransient coder 11 in the monaural layer 40 (or by a correspondinganalyser module in the summed signal 12) are used by the generator 19 todetermine if non-uniform time segmentation in the spatial parameterlayer(s) 14 is required. If the encoder is using an mp3 coder togenerate the monaural layer, then the presence of a window switchingflag in the monaural stream is used by the generator as an estimate of atransient position.

Referring to FIG. 4, the generator 19 may receive an indication that atransient 44 needs to be encoded in one of the subsequent frames of themonaural layer corresponding to the time window of the spatial parameterlayer(s) for which it is about to generate frame(s). It will be seenthat because each spatial parameter layer comprises frames representingoverlapping time segments, for any given time the generator will beproducing two frames per spatial parameter layer. In any case, thegenerator proceeds to generate spatial parameters for a framerepresenting a shorter length window 48 around the transient position.It should be noted that this frame will be of the same format as normalspatial parameter layer frames and calculated in the same manner exceptthat it relates to a shorter time window around the transient position44. This short window length frame provides increased time resolutionfor the multi-channel image. The frame(s) which would otherwise havebeen generated before and after the transient window frame are then usedto represent special transition windows 47, 49 connecting the shorttransient window 48 to the windows 46 represented by normal frames.

In the preferred embodiment, the frame representing the transient window48 is an additional frame in the spatial representation layer bitstream14, however, because transients occur so infrequently, it adds little tothe overall bitrate. It is nonetheless critical that a decoder reading abitstream produced using the preferred embodiment takes into accountthis additional frame as otherwise the synchronization of the monauraland the spatial representation layers would be compromised.

It is also assumed in the present embodiment, because transients occurso infrequently, that only one transient within the window length of anormal frame 46 may be relevant to the spatial parameter layer(s)representation. Even if two transients do occur during the period of anormal frame, it is assumed that the non-uniform segmentation will occuraround the first transient as indicated in FIG. 3. Here three transients44 are shown encoded in respective monaural frames. However, it is thesecond rather than the third transient which will be used to indicatethat the spatial parameter layer frame representing the same time period(shown below these transients) should be used as a first transitionwindow, prior to the transient window derived from an additional spatialparameter layer frame inserted by the encoder and in turn followed by aframe which represents a second transition window.

Nonetheless, it is possible that not all transient positions encoded inthe monaural layer will be relevant for the spatial parameter layer(s)as is the case of the first transient 44 in FIG. 3. Thus, the bit-streamsyntax for either the monaural or the spatial representation layer caninclude indicators of transient positions that are relevant or not forthe spatial representation layer.

In the preferred embodiment, it is the generator 19 which makes thedetermination of the relevance of a transient for the spatialrepresentation layer by looking at the difference between the estimatedspatial parameters (ILD, ITD and correlation (r)) derived from a largerwindow (e.g. 1024 samples) that surrounds the transient location 44 andthose derived from the shorter window 48 around the transient location.If there is a significant change between the parameters from the shortand coarse time intervals, then the extra spatial parameters estimatedaround the transient location are inserted in an additional framerepresenting the short time window 48. If there is little difference,the transient location is not selected for use in the spatialrepresentation and an indication is included in the bitstreamaccordingly.

Finally, once the monaural 40 and spatial representation 14 layers havebeen generated, they are in turn written by a multiplexer 43 to abitstream 50. This audio stream 50 is in turn furnished to e.g. a databus, an antenna system, a storage medium etc.

Synthesis

Referring now to FIG. 2, a decoder 60 includes a de-multiplexer 62 whichsplits an incoming audio stream 50 into the monaural layer 40′ and inthis case a single spatial representation layer 14′. The monaural layer40′ is read by a conventional synthesizer 64 corresponding to theencoder which generated the layer to provide a time domain estimation ofthe original summed signal 12′.

Spatial parameters 14′ extracted by the de-multiplexer 62 are thenapplied by a post-processing module 66 to the sum signal 12′ to generateleft and right output signals. The post-processing module of thepreferred embodiment also reads the monaural layer 40′ information tolocate the positions of transients in this signal. (Alternatively, thesynthesizer 64 could provide such an indication to the post-processor;however, this would require some slight modification of the otherwiseconventional synthesizer 64.)

In any case, when the post-processor detects a transient 44 within amonaural layer frame 42 corresponding to the normal time window of theframe of the spatial parameter layer(s) 14′ which it is about toprocess, it knows that this frame represents a transition window 47prior to a short transient window 48. The post-processor knows the timelocation of the transient 44 and so knows the length of the transitionwindow 47 prior to the transient window and also that of the transitionwindow 49 after the transient window 48. In the preferred embodiment,the post-processor 66 includes a blending module 68 which, for the firstportion of the window 47, mixes the parameters for the window 47 withthose of the previous frame in synthesizing the spatial representationlayer(s). From then until the beginning of the transient window 48, onlythe parameters for the frame representing the window 47 are used insynthesizing the spatial representation layer(s). For the first portionof the transient window 48 the parameters of the transition window 47and the transient window 48 are blended and for the second portion ofthe transient window 48 the parameters of the transition window 49 andthe transient window 48 are blended and so on until the middle of thetransition window 49 after which inter-frame blending continues asnormal.

As explained above, the spatial parameters used at any given time are ablend of either the parameters for two normal window 46 frames, a blendof parameters for a normal 46 and a transition frame 47,49, those of atransition window frame 47,49 alone or a blend of those of a transitionwindow frame 47,49 and those of a transient window frame 48. Using thesyntax of the spatial representation layer, the module 68 can selectthose transients which indicate non-uniform time segmentation of thespatial representation layer and at these appropriate transientlocations, the short length transient windows provide for better timelocalisation of the multi-channel image.

Within the post-processor 66, it is assumed that a frequency-domainrepresentation of the sum signal 12′ as described in the analysissection is available for processing. This representation may be obtainedby windowing and FFT operations of the time-domain waveform generated bythe synthesizer 64. Then, the sum signal is copied to left and rightoutput signal paths. Subsequently, the correlation between the left andright signals is modified with a decorrelator 69′, 69″ using theparameter r. For a detailed description on how this can be implemented,reference is made to European patent application, titled “Signalsynthesizing”, filed on 12 Jul. 2002 of which D. J. Breebaart is thefirst inventor (our reference PHNL020639). That European patentapplication discloses a method of synthesizing a first and a secondoutput signal from an input signal, which method comprises filtering theinput signal to generate a filtered signal, obtaining the correlationparameter, obtaining a level parameter indicative of a desired leveldifference between the first and the second output signals, andtransforming the input signal and the filtered signal by a matrixingoperation into the first and second output signals, where the matrixingoperation depends on the correlation parameter and the level parameter.Subsequently, in respective stages 70′, 70″, each subband of the leftsignal is delayed by −ITD/2, and the right signal is delayed by ITD/2given the (quantized) ITD corresponding to that subband. Finally, theleft and right subbands are scaled according to the ILD for that subbandin respective stages 71′, 71″. Respective transform stages 72′, 72″ thenconvert the output signals to the time domain, by performing thefollowing steps: (1) inserting complex conjugates at negativefrequencies, (2) inverse FFT, (3) windowing, and (4) overlap-add.

The preferred embodiments of decoder and encoder have been described interms of producing a monaural signal which is a combination of twosignals—primarily in case only the monaural signal is used in a decoder.However, it should be seen that the invention is not limited to theseembodiments and the monaural signal can correspond with a single inputand/or output channel with the spatial parameter layer(s) being appliedto respective copies of this channel to produce the additional channels.

It is observed that the present invention can be implemented indedicated hardware, in software running on a DSP (Digital SignalProcessor) or on a general-purpose computer. The present invention canbe embodied in a tangible medium such as a CD-ROM or a DVD-ROM carryinga computer program for executing an encoding method according to theinvention. The invention has particular application in the fields ofInternet download, Internet Radio, Solid State Audio (SSA), bandwidthextension schemes, for example, mp3PRO, CT-aacPlus, and most audiocoding schemes.

1. A method of coding an audio signal, the method comprising the actsof: generating a monaural signal, analyzing the spatial characteristicsof at least two audio channels to obtain one or more sets of spatialparameters for successive time slots, responsive to said monaural signalcontaining a transient at a given transient time, determining anon-uniform time segmentation of said sets of spatial parameters for aperiod including said transient time, determining a relevance of saidtransient by looking at a difference between first estimated spatialparameters derived from a first window that surrounds a transientlocation of said transient and second estimated spatial parametersderived from a second window around said transient location, the secondwindow being shorter than the first window; generating an encoded signalcomprising the monaural signal and the one or more sets of spatialparameters; and if said difference is larger than a threshold, theninserting in the encoded signal additional parameters estimated aroundsaid transient location.
 2. The method according to claim 1 wherein saidmonaural signal comprises a combination of at least two input audiochannels.
 3. The method according to claim 1 wherein said monauralsignal is generated with a parametric sinusoidal coder, said codergenerating frames corresponding to successive time slots of saidmonaural signal, at least some of said frames including parametersrepresenting a transient occurring in the respective time slotsrepresented by said frames.
 4. The method according to claim 1 whereinsaid monaural signal is generated with a waveform encoder, said waveformencoder determining a non-uniform time segmentation of said monauralsignal for a period including said transient time.
 5. The methodaccording to claim 4 wherein said waveform encoder is an mp3 encoder. 6.The method according to claim 1 wherein said sets of spatial parametersinclude at least two localization cues.
 7. The method according to claim6 wherein said sets of spatial parameters further comprises a parameterthat describes a similarity or dissimilarity of waveforms that cannot beaccounted for by the localization cues.
 8. The method according to claim7 wherein the parameter is a maximum of a cross-correlation function. 9.The method of claim 1, wherein the additional parameters are inserted inan additional frame representing the second window around the transientlocation.
 10. The method of claim 1, further comprising the act ofincluding in the encoded signal an indication that the transientlocation is not selected for use in a spatial representation if thedifference is below the threshold.
 11. The method of claim 1, whereinthe transient is a first transient in a frame containing a plurality oftransients.
 12. An encoder for coding an audio signal, the encodercomprising: a sum generator configured to generate a monaural signal, ananalyzer configured to analyze spatial characteristics of at least twoaudio channels to obtain one or more sets of spatial parameters forsuccessive time slots, a transient coder, responsive to said monauralsignal containing a transient at a given transient time, configured todetermine a non-uniform time segmentation of said sets of spatialparameters for a period including said transient time, a parametergenerator configured to determine a relevance of said transient bylooking at a difference between first estimated spatial parametersderived from a first window that surrounds a transient location of saidtransient and second estimated spatial parameters derived from a secondwindow around said transient location, the second window being shorterthan the first window; and a multiplexer configured to generate anencoded signal comprising the monaural signal and the one or more setsof spatial parameters; wherein the parameter generator is furtherconfigured to insert in the encoded signal additional parametersestimated around said transient location if said difference is largerthan a threshold.
 13. An apparatus for supplying an audio signal, theapparatus comprising: an input for receiving an audio signal, an encoderas claimed in claim 12 for encoding the audio signal to obtain anencoded audio signal, and an output for supplying the encoded audiosignal.
 14. A storage medium on which an encoded signal has been stored,the signal comprising: a monaural signal containing at least oneindication of a transient occurring at a given time in said monauralsignal; and one or more sets of spatial parameters for successive timeslots of said signal, said sets of spatial parameters providing anon-uniform time segmentation of audio signal for a period includingsaid transient time; wherein the one or more sets of spatial parametersis indicative of a difference being larger than a threshold, thedifference being between first estimated spatial parameters derived froma first window that surrounds a transient location of said transient andsecond estimated spatial parameters derived from a second window aroundsaid transient location, the second window being shorter than the firstwindow.
 15. A method of decoding an encoded audio signal, the methodcomprising: obtaining a monaural signal from the encoded audio signal,obtaining one or more sets of spatial parameters from the encoded audiosignal, and responsive to said monaural signal containing a transient ata given time, determining a non-uniform time segmentation of said setsof spatial parameters for a period including said transient time, andapplying the one or more sets of spatial parameters to the monauralsignal to generate a multi-channel output signal, wherein the one ormore sets of spatial parameters is indicative of a difference beinglarger than a threshold, the difference being between first estimatedspatial parameters derived from a first window that surrounds atransient location of said transient and second estimated spatialparameters derived from a second window around said transient location,the second window being shorter than the first window.
 16. A decoder fordecoding an encoded audio signal comprising: a de-multiplexer configuredto obtain a monaural signal and one or more sets of spatial parametersfrom the encoded audio signal, and a post-processor, responsive to saidmonaural signal containing a transient at a given time, configured todetermine a non-uniform time segmentation of said sets of spatialparameters for a period including said transient time, thepost-processor being further configured to apply the one or more sets ofspatial parameters to the monaural signal to generate a multi-channeloutput signal, wherein the one or more sets of spatial parameters isindicative of a difference being larger than a threshold, the differencebeing between first estimated spatial parameters derived from a firstwindow that surrounds a transient location of said transient and secondestimated spatial parameters derived from a second window around saidtransient location, the second window being shorter than the firstwindow.
 17. An apparatus for supplying a decoded audio signal, theapparatus comprising: an input for receiving an encoded audio signal, adecoder as claimed in claim 16 for decoding the encoded audio signal toobtain a multi-channel output signal, an output for supplying orreproducing the multi-channel output signal.